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Method for Detecting Talking Heads in a Compressed Video 
Field of the Invention 

The present invention relates generally to extracting motion activity from a 
compressed video, and more particularly, to identifying talking heads or principal 
cast in a compressed video. 

Background of the Invention 

Compressed Video Formats 

Basic standards for compressing the bandwidth of digital color video signals have 
been adopted by the Motion Picture Experts Group (MPEG). The MPEG standards 
achieve high data compression rates by developing information for a full frame of 
the image only every so often. The full image frames, i.e. intra-coded frames, are 
often referred to as "I-frames" or "anchor frames," and contain full frame 
information independent of any other frames. Image difference frames, i.e. inter- 
coded frames, are often referred to as "B-frames" and "P-frames," or as "predictive 
frames," and are encoded between the I-frames and reflect only image differences 
i.e. residues, with respect to the reference frame. 

Typically, each frame of a video sequence is partitioned into smaller blocks of 
picture element, i.e. pixel, data. Each block is subjected to a discrete cosine 
transformation (DCT) function to convert the statistically dependent spatial domain 
pixels into independent frequency domain DCT coefficients. Respective 8x8 or 
16x16 blocks of pixels, referred to as "macro-blocks," are subjected to the DCT 
function to provide the coded signal. 
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The DCT coefficients are usually energy concentrated so that only a few of the 
coefficients in a macro-block contain the main part of the picture information. For 
example, if a macro-block contains an edge boundary of an object, the energy in 
5 that block after transformation, i.e., as represented by the DCT coefficients, 

includes a relatively large DC coefficient and randomly distributed AC coefficients 
throughout the matrix of coefficients. 

A non-edge macro-block, on the other hand, is usually characterized by a similarly 
10 large DC coefficient and a few adjacent AC coefficients which are substantially 

larger than other coefficients associated with that block. The DCT coefficients are 
2 typically subjected to adaptive quantization, and then are run-length and variable- 
JJJ length encoded for the transmission medium. Thus, the macro-blocks of 
Nl transmitted data typically include fewer than an 8 x 8 matrix of codewords. 

M The macro-blocks of inter-coded frame data, i.e. encoded P or B frame data, 

M= include DCT coefficients which represent only the differences between a predicted 

p pixels and the actual pixels in the macro-block. Macro-blocks of intra-coded and 

mter-coded frame data also include information such as the level of quantization 
20 employed, a macro-block address or location indicator, and a macro-block type. 

The latter information is often referred to as "header" or "overhead" information. 

Each P frame is predicted from the lastmost occurring I or P frame. Each B frame 
is predicted from an I or P frame between which it is disposed. The predictive 
25 coding process involves generating displacement vectors, often referred to as 

"motion vectors," which indicate the magnitude of the displacement to the macro- 
block of an I frame most closely matches the macro-block of the B or P frame 
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currently being coded. The pixel data of the matched block in the I frame is 
subtracted, on a pixel-by-pixel basis, from the block of the P or B frame being 
encoded, to develop the residues. The transformed residues and the vectors form 
part of the encoded data for the P and B frames. 

5 

Older video standards, such as ISO MPEG-1 and MPEG-2, are relatively low-level 
specifications primarily dealing with temporal and spatial compression of video 
signals. With these standards, one can achieve high compression ratios over a wide 
range of applications. Newer video coding standards, such as MPEG-4, see 
10 "Information Technology ~ Generic coding of audio/visual objects," ISO/IEC 
u FDIS 14496-2 (MPEG4 Visual), Nov. 1998, allow arbitrary-shaped objects to be 
2 encoded and decoded as separate video object planes (VOP). These emerging 
jjj standards are intended to enable multimedia applications, such as interactive video, 
where natural and synthetic materials are integrated, and where access is universal. 
45 For example, one might want to extract features from a particular type of video 
h* object, or to perform for a particular class of video objects. 

s 

" s si: 

G With the advent of new digital video services, such as video distribution on the 
INTERNET, there is an increasing need for signal processing techniques for 

20 identifying information in video sequences, either at the frame or object level, for 
example, identification of activity. 

Feature Extraction 

Previous work in feature extraction for identification and indexing from 
25 compressed video has primarily emphasized DC coefficient extraction. In a paper 
entitled "Rapid Scene Analysis on Compressed Video," IEEE Transactions on 
Circuits and Systems for Video Technology, Vol. 5, No. 6, December 1995, page 
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533-544, Yeo and Liu describe an approach to scene change detection in the 
MPEG-2 compressed video domain. The authors also review earlier efforts at 
detecting scene changes based on sequences of entire uncompressed image data, 
and various compressed video processing techniques of others. Yeo and Liu 
5 introduced the use of spatially reduced versions of the original images, so-called 
DC images, and DC sequences extracted from compressed video to facilitate scene 
analysis operations. Their "DC image" is made up of pixels which are the average 
value of the pixels in a block of the original image and the DC sequence is the 
combination of the reduced number of pixels of the DC image. It should be noted 
10 that the DC image extraction based technique is good for I-frames since the 
u extraction of the DC values from I-frames is relatively simple. However, for other 
y type frames, additional computation is needed. 

^ Won et al, in a paper published in Proc. SPIE Conf. on Storage and Retrieval for 

yy 

t5 Image and Video Databases, January 1998, describe a method of extracting 
M= features from compressed MPEG-2 video by making use of the bits expended on 
H- the DC coefficients to locate edges in the frames. However, their work is limited to 
O I-frames only. Kobla et al describe a method in the same Proceedings using the DC 
image extraction of Yeo et al to form video trails that characterize the video clips. 

20 

Feng et al. (IEEE International Conference on Image Processing, Vol. II, pp. 821- 
824, Sept. 16-19, 1996), use the bit allocation across the macro-blocks of MPEG-2 
frames to detect shot boundries, without extracting DC images. Feng et al.'s 
technique is computationally the simplest since it does not require significant 
25 computation beyond that required for parsing the compressed bit-stream. 
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U.S. Patent Applications entitled "Methods of scene change detection and fade 
detection for indexing of video sequences," (Application Sn. 09/231,698, filed 
January 14, 1999), "Methods of scene fade detection for indexing of video 
sequences," (Application Serial Number 09/231,699, filed January 14, 1999), 
"Methods of Feature Extraction for Video Sequences," (Application Sn. 
09/236,838, January 25, 1999), describe computationally simple techniques which 
build on certain aspects of Feng et al.'s approach and Yeo et al's approach to give 
accurate and simple scene change detection. 

After a suspected scene or object change has been accurately located in a group of 
consecutive frames by use of a DC image extraction based technique, application 
of an appropriate bit allocation-based technique and/or an appropriate DC residual 
coefficient processing technique to P or B-frame information in the vicinity of the 
located scene quickly and accurately locates the cut point. This combined method 
is applicable to either MPEG-2 frame sequences or MPEG-4 multiple object 
sequences. In the MPEG-4 case, it is advantageous to use a weighted sum of the 
change in each object of the frame, using the area of each object as the weighting 
factor. Locating scene changes is useful for segmenting a video into shots. 

U.S. Patent Application Sn. 09/345,452 entitled "Compressed Bit-Stream Segment 
Identification and Descriptor," filed by Divakaran et al. on July 1, 1999 describes a 
technique where magnitudes of displacements of inter-coded frames are 
determined based on the number bits in the compressed bit-stream associated with 
the inter-coded frames. The inter-coded frame includes macro-blocks. Each macro- 
block is associated with a respective portion of the inter-coded frame bits which 
represent the displacement from that macro-block to the closest matching intra- 
coded frame. The displacement magnitude is an average of the displacement 



MH-5089 
Divakaran et al. 

magnitudes of all the macro-blocks associated with the inter-coded frame. The 
displacement magnitudes of those macro-blocks which are less than the average 
displacement magnitude are set to zero. The number of run-lengths of zero 
magnitude displacement macro-blocks is determined to identify the first inter- 
5 coded frame. 

Motion Activity 

Prior art motion activity work has mainly focused on extracting motion activity 
and using the motion activity for low level applications such as detecting scene or 

10 shot changes, see U.S. Patent Application 09/236,838 "Methods of Feature 

u Extraction of Video," filed by Divakaran et al. on January 25, 1999, incorporated 

y herein by reference. 

fy 

T. 

:ii 

N Motion activity can also be used to gauge the general motion activity and the 
#5 spatial distribution of motion activity in video shots. Such descriptors have been 
M| successful in video browsing applications by filtering out all the high action shots 
t* from low actions shots, see United States Patent 5,552,832 "Run-length encoding 
□ sequence for video signals," issued to Astle on September 3, 1996. The strength of 
such descriptors lies in their computational simplicity. 

20 

It is desired to rapidly identify segments or shots of a video that include talking 
heads, and those shots that do not. Using motion activity, in the compressed 
domain, could speed up segmenting and indexing of reduced size videos for more 
sophisticated detection of talking heads, see for example, Y. Wang, Z Liu and J-C. 
25 Huang, "Multimedia Content Analysis," IEEE Signal Processing Magazine, 
November 2000. Prior art talking head detection has been mainly focused on 
detecting colors, e.g., flesh, or detecting faces, which requires complex operations. 
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Summary of the Invention 

The invention provides a method for identifying frames in a compressed video that 
5 include "principal cast" or other "talking heads." Then, the video can be rapidly 
segmented, and computationally more expensive face detection and recognition 
processes can be employed on just the frames of the reduced size video. 

The invention uses a template obtained from the centroid of a ground truth set of 
10 features, alternatively, multiple clustered templates can also be used. The feature 
u vectors of the templates can be modeled using a Gaussian mixture model (GMM) 
g applied to training data. 

St More particularly, the invention provides a method for identifying a talking head or 
115 principal cast in a compressed video. The video is first segmented into shots. Then, 
M motion activity descriptors are extracted from each of the shots, and combined into 
u a shot motion activity descriptor. A distance between the shot motion activity 
O descriptor and a template motion activity descriptor is measured. The template 

motion activity descriptor is obtained from a training video. If the measured 
20 distance is less than a predetermined threshold, then the shot is identified as 

including a talking head. 
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Brief Description of the Drawings 

Figure 1 is a block diagram of an activity descriptor according to the invention; 

Figure 2 is a flow diagram of a method for extracting the activity descriptor from 
the magnitudes of motion vectors of a frame; and 

Figure 3 is a flow diagram of a method for identifying talking heads in a 
compressed video according to the invention; 

Detailed Description of the Preferred Embodiment 
Motion Activity Descriptor 

Figure 1 shows an activity descriptor 100 that is used to detect talking heads in a 
compressed video 102, according to the invention. The video 102 includes 
sequences of frames (f 0 , f n ) that form "shots" 103. Hereinafter, a shot, scene, or 
a segment of the video 102 means a set of frames that have some temporal 
cohesiveness, for example, all frames taken between a lens opening and closing. 
The invention analyzes uses spatial, temporal, directional, and intensity 
information in the video 102. 

The spatial information expresses the size and number of moving regions in the 
shot on a frame by frame basis. The spatial information distinguishes between 
"sparse" shots with a small number of large moving regions, for example, a 
"talking head," and a "dense" shot with many small moving regions, for example, 
a football game. Therefore, a sparse level of activity shot is said to have a small 
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number of large moving regions, and a dense level of activity shot is said to have a 
large number of small moving regions. 

The distribution of the temporal information expresses the duration of each level of 
5 activity in the shot. The temporal information is an extension of the intensity of 
motion activity in a temporal dimension. The direction information expresses the 
dominant direction of the motion in a set of eight equally spaced directions. The 
direction information can be extracted from the average angle (direction) of the 
motion vectors in the video. 

10 

Therefore, the activity descriptor 100 combines 110 intensity 111, direction 112, 
y spatial 113, and temporal 1 14 attributes of the level of activity in the video 
W sequence 102. 

Hi Motion Vector Magnitude 

H The parameters for motion activity descriptor 100 are derived from the magnitude 
O of video motion vectors as follows. For object or frame an "activity matrix" C mv is 

defined as: 
20 where , 

where (Xij,y itj ) is the motion vector associated with the (i,j)th block B. For the 
purpose of extracting the activity descriptor 100 in an MPEG video, the descriptor 
for a frame or object is constructed according to the following steps. 
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Motion Activity Descriptor Extraction 

Figure 2 shows a method 200 for extracting activity attributes 100. In step 210, 
intra-coded blocks, B(iJ) 21 1 are set to zero. Step 220 determines the average 
motion vector magnitude C a J 221, or "average motion complexity," for each block 
B of the frame/object as: 

1 M N 

Qav Z _. 1 y y q a ft 

Mrs i=0 7=0 

M = width in blocks 
N = height in blocks 

Step 230 determines the variance a 2 231 of C^as: 

1 M N 

<=™IK^ay)-CD 2 

Mly i=o j=o 

M = width in blocks 
N = height in blocks 

Step 240 determines the "run-length" parameters 241 of the motion vector activity 
matrix C mv by using the average as a threshold on the activity matrix as: 

C m (/. J), if C„(i,;)>Cr 
0, otherwise. 

For the purpose of the following description, the zero run-length parameters, in 
terms of a raster-scan length, are of particular interest. 



i thresh 
mv 



(«. j) = 
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We classify zero run-length parameters into three categories: short, medium and 
long. The zero run-length parameters are normalized with respect to the 
object/frame width. Short zero run-lengths are defined to be 1/3 of the frame width 
or less, medium zero run-lengths are greater than 1/3 of the frame width and less 
than 2/3 of the frame width. Long zero run-lengths are equal to or greater than the 
width of the frame, i.e., the run-length extends over several raster-scan lines in a 
row. For a further description of "zero run-lengths" see U.S. Patent Application 
09/236,838 "Methods of Feature Extraction of Video," filed by Divakaran et al. on 
January 25, 1999, incorporated herein by reference. 

In the notation below, we use the parameter N sr as the number of short zero run- 
lengths, and medium zero run-lengths, and long zero run-lengths are similarly 
defined with the parameters N mr and N lr , respectively. The zero run-length 
parameters are quantitized to obtain some invariance with respect to rotation, 
translation, reflection, and the like. 

Therefore, the motion activity descriptor 100 for the frame/object include: 

CZ g ,N sr ,N mr ,N lr ,a fr , 

where a is the standard deviation. 
Talking Head Identification Method 

As shown in Figure 3, we use the MPEG-7 motion activity descriptor 100, as 
described above, to identify "talking heads" or "principal cast" member in a 
compressed video. Finding the talking head, or more narrowly, the "news-anchor 
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shots," enables video summarization by establishing beginnings and endings of 
news-stories, for example. 

First, in a set of one time only preprocessing steps, a template motion activity 
descriptor (T) 301 is formed. The template can be constructed semi-automatically, 
or automatically from representative "training" talking head shots. The latter is 
done by extracting 310 motion activity descriptors (MAD) 100 from a training 
video 302. The training video can include a large number of shots, for example, ten 
to hundreds of typical talking head shots. The training video can include shots, 
from American, Mexican, Japanese, Chinese, and other news programs showing 
the portions of the programs that just include the anchor person or talking head. 
The motion activity descriptors 100 are combined 320 to form the template motion 
activity descriptor (T) 301. The combining 320 can be centroid or average of the 
motion activity descriptors 100. As an optional step, a weighted or normalized (W) 
330 factor can be produced according to: 

W,o, = C avg (T) + N sr (r) + N mr (D + N, r (D 

After the template 301 is formed, talking head shots in a video 303 are identified as 
follows. First, the video 303 can be segmented 340 into shots 304 using any known 
segmentation process. If the segmentation is based on compressed DC images, then 
the shot segmentation and the shot identification can be done in a single pass. 
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Then, motion activity descriptors are extracted 350 from each shot 304. The 
motion activity descriptors are combined into a single shot (S) descriptor 305, as 
described for the template 301 above. Then, for each shot 304, a distance D(S A T) is 
measured 360 according to: 

D(S, T) = | C avg (T) - C avg (S) | + | N ir (T) - N sr (S) \ 

^ ~™ { Kr(T)-N m/ XS)\+^^\N lr (T)-N lr (S)\ 



N mr (T) 1 N Ir (T) 
5 where T is the template motion activity descriptor 301, and S is the shot motion 

activity descriptor 305 of the shot which is being tested for a talking head 

identification. 

□ We then apply thresholding 370 on the distance, using, for example, the standard 
iff) deviation a of the template motion activity descriptor, as described above. If the 
J measured distance is within the standard deviation, then the shot is identified as a 

talking head shot 306. Shots identified as talking head shots can be retained for 
fjl further processing or indexing, and all other shots can be discarded. 

"-s 

fP5 We can take into consideration the fact that talking head shots are homogenous. In 
this case, after identifying a shot as a talking head shot, as per the distance from 
one of the templates, we can check its homogeneity as a double check. We check 
its homogeneity by determining the difference between the mean of the motion 
activity descriptors and the median of the motion activity descriptors. If the 

20 difference exceeds a certain determined threshold, we declare that it is not a talking 
head. We get some improvement in the results with this additional test compared to 
using the distance from the template(s) alone. 
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The basic motion-based talking head identification method according to the 
invention, is computationally simple and elegant, in contrast with prior art color or 
structure based methods. However, the number of false alarms does not fall as one 
reduces the size of the shots, as it should. This is probably be due to the fact that 
5 the motion activity descriptors are averaged over the shot, and the single template 
301is unable to correctly capture the temporal variation of talking head features for 
an entire shot. There, the method of the invention can also use multiple templates. 
In this case, the template T 301 becomes a set of templates, and the distance is 
measured between the shot motion activity descriptor, and the descriptors of the set 
10 of templates. In this case the thresholding can be based on minimum or maximum 
distance values. 

sss£: 

5JJ Gaussian Mixtures 

W The template or set of templates 301 are formed using discrete functions, e.g., a 
M; vector of four elements. However, the low dimension vectors of the templates can 
M- also be formed, during the one time preprocessing, using continuous functions, for 
a example, a probability density. In this case, a Gaussian mixture model (GMM) 307 

that best fits the training video 302 is first trained. As an advantage, the GMM 
20 forms smooth approximations to arbitrarily shaped densities, and captures "fuzzy" 

or probabilistic features of the training video 302. 

We can then use well known maximum likelihood (ML) estimation to update the 
model parameters, i.e., the mean, variance and mixture weight, which maximize 
25 the likelihood of the GMM, given the training video 302. Depending on the 

number of templates desired for the identification method, we can select the means 



14 



MH-5089 
Divakaran et al. 

of component Gaussians as the set of templates 301, in a decreasing order of 
mixture weights. 

Distance Metrics 

5 

It is also possible to measure the semi-Hausdorff distance (d sh ) between the 
templates and motion activity descriptor of each frame of a particular shot. The 
semi-Hausdorff distance d sh between the motion activity descriptor of a particular 
template T 301 and a set of frames in a particular video shot 304 is defined as 
10 follows. 

y A set of templates 301, includes m elements T ( i = 1 , m, and a shot S to be tested 
j}j for a "talking head" containing n frames $ i = 1, n. A distance between a 
template T ( and a particular frame S* is d(T h Sd, as defined above. 

S 

M; The distance d\ for each of the frames T h is 
M= di = min(d(T h T R )), for k =0, m, and then 

O the semi-Hausdorff distance between T and S is 
d sh (T, S) = max(di), for i=l, n. 

20 

In other words, first, for all /, we measure the distance d { between each frame St and 
its best representative in the template set T 301. Next, we determine the maximum 
of the distances d t , as above. Thus, we determine how "close" the shot 304 is to 
the template set T 301. If the representation is better, then the semi-Hausdorff 
25 distance between the frames S and the templates T is lower. For example, if a shot 
has a low semi-Hausdorff distance, then this indicates homogeneity of the shot 
with respect to the chosen template set. 
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The performance of the method according to the invention is better when multiple 
templates are used instead of just a single template. However, this is improvement 
comes with additional increase in complexity of finding the semi- Hausdorff 
distance between the template set and the frames of the shot. The complexity can 
be simplified by sampling 308 the shot and using the sampled subset of frames in 
the shot to derive the distancesa, without substantially reducing the performance of 
the method. 

This invention is described using specific terms and examples. It is to be 
understood that various other adaptations and modifications may be made within 
the spirit and scope of the invention. Therefore, it is the object of the appended 
claims to cover all such variations and modifications as come within the true spirit 
and scope of the invention. 
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